Smart Bin Optimized Waster Collection Route¶
Authored by: Sachitha Sadeepa Kasthuriarachchi
Duration: 180 mins
Level: Beginner/Intermediate
Pre-requisite Skills: Python, Jupyter Notebooks, Power BI and Data Analysis
Scenario:
In Melbourne, managing urban waste efficiently is paramount for promoting cleanliness, minimizing environmental degradation, and safeguarding public health. The introduction of smart bin technology, which facilitates real-time monitoring of waste levels, presents an innovative opportunity to overhaul the city's waste collection methodologies. This initiative focuses on refining waste collection routes by utilizing sophisticated data analytics, including real-time data on bin fill levels and precise location coordinates of smart bins. Although some of this data can be accessed through the client's open data portal, incorporating additional datasets, such as real-time traffic conditions, may necessitate partnerships with local government bodies or external data providers. The goal of this project is to streamline waste management processes in Melbourne by optimizing collection routes, thereby curtailing fuel usage, alleviating traffic congestion, and boosting overall operational efficiency. This strategic approach not only enhances service reliability but also contributes significantly to the city's environmental and economic sustainability.
User Story:
As the waste management authority in Melbourne, I aim to streamline waste collection routes for smart bins to enhance waste collection efficiency. My goal is to minimize fuel consumption, reduce traffic congestion, and cut operational costs.
Acceptance Criteria:
Efficient Route Planning: The solution should provide optimized routes for waste collection vehicles to pick up waste from smart bins across different areas of Melbourne.
Real-time Data Integration: The system should integrate real-time data from smart bins to dynamically adjust collection routes based on current fill levels and demand patterns.
Minimize Environmental Impact: The optimized routes should aim to reduce fuel consumption and carbon emissions by minimizing travel distance and time.
Cost-effectiveness: The optimized routes should minimize operational costs associated with waste collection, including labor, fuel, and vehicle maintenance.
Scalability: The solution should be scalable to accommodate the growing number of smart bins and evolving waste collection needs in Melbourne.
User-Friendly Interface: The interface should be intuitive and user-friendly, allowing waste management authorities to easily visualize and manage waste collection routes.
What this use case will teach you
At the end of this use case you will learn:
Route Optimization: Ability to develop algorithms and methodologies for optimizing waste collection routes, considering factors such as distance, time, and resource utilization.
Data Integration: Proficiency in integrating real-time data from smart bins into route planning algorithms, ensuring accurate and timely decision-making.
Environmental Impact Analysis: Capability to assess and mitigate environmental impacts of waste collection operations, including reducing fuel consumption and carbon emissions.
Cost Optimization: Skills in cost analysis and resource allocation to minimize operational expenses associated with waste collection, including labor, fuel, and vehicle maintenance.
Scalability Planning: Ability to design scalable solutions to accommodate the growing number of smart bins and evolving waste management needs in urban environments.
User Experience Design: Proficiency in designing user-friendly interfaces for waste management authorities, ensuring ease of use and efficient management of collection routes.
1. Importing Data¶
Dataset link Smart bins Argyle Square
1.0 Dataset Imported through API¶
import requests
import pandas as pd
import os
def fetch_data(base_url, dataset, api_key, num_records=99, offset=0):
all_records = []
max_offset = 9900 # Maximum number of requests
while True:
# maximum limit check
if offset > max_offset:
break
# Create API request URL
filters = f'{dataset}/records?limit={num_records}&offset={offset}'
url = f'{base_url}{filters}&api_key={api_key}'
# Start request
try:
result = requests.get(url, timeout=10)
result.raise_for_status()
records = result.json().get('results')
except requests.exceptions.RequestException as e:
raise Exception(f"API request failed: {e}")
if records is None:
break
all_records.extend(records)
if len(records) < num_records:
break
# next cycle offset
offset += num_records
# DataFrame all data
df = pd.DataFrame(all_records)
return df
API_KEY = os.environ.get("API_KEY ")
print("API Key:", API_KEY)
BASE_URL = "https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/"
API Key: None
1.2 Fetching Smart Bin Sensor Data¶
This code initiates the process of gathering data for the project by accessing a specific dataset named smart-bins-argyle-square, which consists of sensor data from smart bins. The main steps involved are:
- Define Dataset Name: Assign the name of the dataset to the variable
Bin_Sensor_Data. - Fetch Data: Utilize the
fetch_datafunction, providing it with the base URL (BASE_URL), the dataset name (Bin_Sensor_Data), and an API key (API_KEY) for authorization. This action retrieves the dataset from the specified source. - Preview Data: Apply the
.head()method to the resulting DataFramedatato display the first few entries. This step is crucial for a preliminary examination of the data structure, including column names and initial values.
This sequence of operations is pivotal for starting our data analysis, allowing us to understand the dataset's structure and prepare for in-depth exploration.
Bin_Sensor_Data = 'smart-bins-argyle-square'
data= fetch_data(BASE_URL, Bin_Sensor_Data , API_KEY)
data.head(3)
| time | serial | sensor_name | status_current_fill_level | bin_status | last_update | product_type | description | address | latlong | wastebasket_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-06-29T21:18:25+00:00 | FB1000001905AA05 | None | 14 | Low | None | CleanFLEX | Argyle - FB1000001905AA05 | Argyle Square, 13-19 Argyle Pl N, Carlton VIC ... | {'lon': 144.966267, 'lat': -37.802983} | None |
| 1 | 2021-06-30T12:17:40+00:00 | FB1000001905AA15 | Internal Bin Sensor 2 | 49 | Half Full | None | CleanFLEX | Argyle - FB1000001905AA15 | 138-146 Cardigan St, Carlton VIC 3053, Australia | {'lon': 144.9655, 'lat': -37.802017} | None |
| 2 | 2021-07-01T00:00:21+00:00 | FB1000001905AA17 | External Bin Sensor 1 | 100 | Full | None | CleanFLEX | Argyle - FB1000001905AA17 | 164 Lygon St, Carlton VIC 3053, Australia | {'lon': 144.9661, 'lat': -37.803217} | None |
2.0 Pre processing Data¶
2.1 Identifying Missing Values in the Dataset (part 1)¶
To ensure the integrity and quality of our analysis, it's crucial to check for missing values within our dataset. The code performs the following steps:
- Calculate Missing Values: Using
data.isnull().sum(), we calculate the total number of missing values in each column of our dataset. The methodisnull()identifies missing values across the DataFrame, andsum()aggregates these by column, providing a count of missing values per column. - Display Missing Values: We then print a statement to notify the user that we're checking for missing values, followed by printing the results stored in
missing_values. This output gives us a clear picture of which columns contain missing data and how many missing entries each has.
Understanding where our data is incomplete is crucial for deciding on subsequent data cleaning and imputation strategies, ensuring the reliability of our analysis or model predictions.
missing_values = data.isnull().sum()
print(" Checking for the Missing Values ")
print(missing_values)
Checking for the Missing Values time 0 serial 0 sensor_name 2889 status_current_fill_level 0 bin_status 0 last_update 9999 product_type 0 description 0 address 0 latlong 1808 wastebasket_size 9999 dtype: int64
2.2 Optimizing Data Types for Analysis¶
To ensure our dataset is primed for analysis, we undertake a comprehensive data type optimization process. This involves adjusting the data types of various columns to more accurately reflect their content and optimize memory usage. Here's an overview of the steps taken:
- Timestamp Conversion: Convert the
timeandlast_updatecolumns todatetimeobjects to facilitate time series analysis. - String Specification: Change
serial,sensor_name,description, andaddresscolumns to thestringdata type for consistency in text data handling. - Category Optimization: For columns like
bin_statusandproduct_type, we use thecategorydata type when we have a limited set of possible values, enhancing memory efficiency and performance during grouping operations. - Numerical Adjustments: Ensure numerical data, such as
wastebasket_size, is correctly typed (float) for accurate mathematical computations. - Geolocation Handling: Split the
latlongcolumn into separatelatitudeandlongitudecolumns and convert them tofloatdata types, dropping the originallatlongcolumn to simplify geographical analysis.
These adjustments are crucial for the following reasons:
- Efficiency: Optimizing data types can significantly reduce memory usage, making data manipulation more efficient.
- Accuracy: Ensuring data types match the nature of the data prevents errors during analysis.
- Convenience: Having the right data type (e.g., datetime) simplifies operations like date range filtering or time-based aggregation.
By refining our dataset's structure through these changes, we lay a solid foundation for robust and error-free data analysis.
# Let's change the datatypes to more appropriate types:
data['time'] = pd.to_datetime(data['time'])
data['serial'] = data['serial'].astype('string')
data['sensor_name'] = data['sensor_name'].astype('string')
# 'status_current_fill_level' is already an integer, no change needed.
data['bin_status'] = data['bin_status'].astype('category') # Or 'string' if it's not a finite set of statuses.
data['last_update'] = pd.to_datetime(data['last_update'])
data['product_type'] = data['product_type'].astype('category') # Or 'string' if it's not a finite set of types.
data['description'] = data['description'].astype('string')
data['address'] = data['address'].astype('string')
# 'latlong' will be split into two separate columns for latitude and longitude, and converted to float.
##data.drop('latlong', axis=1, inplace=True)
# Assuming 'wastebasket_size' refers to a numerical measurement and not a category.
data['wastebasket_size'] = data['wastebasket_size'].astype('float') # Or 'category' if it's not numerical.
# Now, we handle the 'latlong' column assuming it's a string of 'latitude,longitude'.
# We will check if the DataFrame is not empty and if 'latlong' contains comma-separated values.
if not data.empty and 'latlong' in data.columns and data['latlong'].str.contains(',').any():
# Split 'latlong' into two new columns 'latitude' and 'longitude' and convert them to floats.
latlong_split = data['latlong'].str.split(',', expand=True)
data['latitude'] = pd.to_numeric(latlong_split[0], errors='coerce')
data['longitude'] = pd.to_numeric(latlong_split[1], errors='coerce')
data.drop('latlong', axis=1, inplace=True)
# Display the DataFrame with new datatypes
data.dtypes
time datetime64[ns, UTC] serial string[python] sensor_name string[python] status_current_fill_level int64 bin_status category last_update datetime64[ns] product_type category description string[python] address string[python] latlong object wastebasket_size float64 dtype: object
2.3 Improving the Detail of Time-Related Data¶
This segment focuses on refining our dataset's temporal data by splitting the 'time' column into distinct 'date' and 'time_only' columns. This transformation aids in detailed time-based analysis and enables more flexible data manipulation. Here are the key steps involved:
Datetime Parsing: Initially, we ensure the 'time' column is correctly parsed into datetime format with
pd.to_datetime(data['time'], errors='coerce'). Theerrors='coerce'parameter converts any parsing errors into NaT (Not a Time), preventing data corruption.Column Creation: Post successful datetime conversion, we proceed to split the datetime column:
- The
datecolumn is created to store only the date component of each entry. - The
time_onlycolumn is created to store only the time component.
- The
Error Handling: In cases where datetime parsing results in all null values (indicating parsing issues), both
dateandtime_onlycolumns are set toNone. This precaution ensures the dataset's integrity remains uncompromised.Previewing Changes: Finally, we display the first few rows of the newly created
dateandtime_onlycolumns to verify the split was successful and to provide a glimpse into the data structure post-transformation.
This enhancement is pivotal for conducting in-depth temporal analysis and offers the flexibility to analyze trends and patterns based on specific dates or times.
# Splitting the 'time' column into separate 'date' and 'time' columns, this time ensuring to handle any potential issues
data['time'] = pd.to_datetime(data['time'], errors='coerce') # Parsing the datetime to ensure correct format
# If datetime parsing was successful, create separate 'date' and 'time' columns
if not data['time'].isnull().all():
data['date'] = data['time'].dt.date
data['time_only'] = data['time'].dt.time
else:
data['date'], data['time_only'] = [None, None]
placeholder_output_structure_2 = data[['date', 'time_only']].head(3)
placeholder_output_structure_2
| date | time_only | |
|---|---|---|
| 0 | 2021-06-29 | 21:18:25 |
| 1 | 2021-06-30 | 12:17:40 |
| 2 | 2021-07-01 | 00:00:21 |
2.4 Identifying Unique Sensor Serial Numbers¶
An essential step in our data analysis process involves identifying the unique serial numbers of the sensors within our dataset. This enables us to differentiate and potentially name each sensor for more detailed and personalized analysis. Here's the procedure:
Retrieve Unique Serials: We use
data['serial'].unique()to extract an array of unique serial numbers from theserialcolumn of our dataset. The.unique()method is instrumental in identifying all distinct sensors present, ensuring no duplicates are considered.Displaying Unique Serials: The resulting array of unique serial numbers is then displayed. This array is crucial for subsequent steps in our analysis, such as naming sensors, analyzing data on a per-sensor basis, or grouping data for sensor-specific insights.
This step is fundamental in understanding the diversity and range of sensors within our dataset, laying the groundwork for targeted sensor-specific analyses and operations.
missing_values_count = data['sensor_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values_count}")
Number of missing values in column_name: 2889
unique_serial_numbers = data['serial'].unique()
unique_serial_numbers
<StringArray> ['FB1000001905AA05', 'FB1000001905AA15', 'FB1000001905AA17', 'FB1000001905AA07', 'FB1000001905AA18', 'FB1000001905AA19', 'FB1000001905AA11', 'FB1000001905AA13', 'FB1000001905AA02', 'FB1000001905AA01', 'FB1000001905AA08', 'FB1000001905AA04', 'FB1000001905AA00', 'FB1000001905AA12', 'FB1000001905AA14', 'FB1000001905AA16', 'FB1000001905AA09', 'FB1000001905AA03', 'FB1000001905AA06', 'FB1000001905AA10'] Length: 20, dtype: string
2.5 Naming Sensors Based on Serial Numbers¶
To enhance clarity and facilitate easier identification of sensors in our dataset, we adopt a naming convention that incorporates the last two digits of each sensor's serial number. This step not only simplifies the analysis but also adds a layer of readability when dealing with multiple sensors. Here's how we accomplish this:
Update Sensor Names: We prefix every sensor name with "Sensor-" and append the last two digits of its serial number using
data['serial'].str[-2:]. This approach ensures each sensor has a unique and identifiable name based on its serial number.Reflecting Changes: After updating the 'sensor_name' column, we create a placeholder DataFrame to preview the changes. This DataFrame includes both 'serial' and 'sensor_name' columns, displaying how the sensor names have been updated based on the serial numbers.
Preview Output: Using
.head()on our placeholder DataFrame, we show the first few rows to verify the naming convention's application and ensure the data integrity is maintained.
This naming strategy not only aids in the quick identification of sensors but also adds a systematic approach to managing sensor data, making subsequent data analysis tasks more intuitive.
data['sensor_name'] = "Sensor-" + data['serial'].str[-2:]
placeholder_output_structure_1 = data[['serial', 'sensor_name']]
placeholder_output_structure_1.head(3)
| serial | sensor_name | |
|---|---|---|
| 0 | FB1000001905AA05 | Sensor-05 |
| 1 | FB1000001905AA15 | Sensor-15 |
| 2 | FB1000001905AA17 | Sensor-17 |
missing_values_count = data['sensor_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values_count}")
Number of missing values in column_name: 0
2.6 Identifying Missing Values in the Dataset (part 2)¶
A crucial step in data preprocessing involves identifying and quantifying missing values within our dataset. This process enables us to understand the extent of missing data and strategize on how to handle it. We approach this in two main steps:
Identifying Missing Values: We utilize
data.isnull().sum()to calculate the total number of missing values in each column of our dataset. This method checks each cell for null values, providing a column-wise sum. The output is printed under "Current Missing Values," offering immediate visibility into which columns contain missing data and to what extent.Calculating Missing Data Proportion: To grasp the significance of the missing values, we calculate their proportion relative to the total dataset size. This is achieved by dividing the number of missing values in each column by the total number of rows in the dataset, then multiplying by 100 to convert it to a percentage. The results are printed under "Current Missing Values Proportion," giving us a clear picture of how much data is missing across each column.
This assessment not only highlights the areas of the dataset that may require attention due to missing data but also informs our decision-making process regarding data cleaning techniques, such as imputation or removal, ensuring the robustness of our analysis.
missing_values = data.isnull().sum()
print("Current Missing Values")
print(missing_values)
# Calculation of the missing proportion
missing_proportion = (missing_values / len(data)) * 100
print("Current Missing Values Proportion:")
print(missing_proportion)
Current Missing Values time 0 serial 0 sensor_name 0 status_current_fill_level 0 bin_status 0 last_update 9999 product_type 0 description 0 address 0 latlong 1808 wastebasket_size 9999 date 0 time_only 0 dtype: int64 Current Missing Values Proportion: time 0.000000 serial 0.000000 sensor_name 0.000000 status_current_fill_level 0.000000 bin_status 0.000000 last_update 100.000000 product_type 0.000000 description 0.000000 address 0.000000 latlong 18.081808 wastebasket_size 100.000000 date 0.000000 time_only 0.000000 dtype: float64
2.7 Cleaning the Dataset by Handling Missing Data¶
A vital part of preparing our dataset for analysis involves ensuring that it does not contain significant gaps that could distort our findings. To this end, we take two decisive actions to handle missing data effectively:
Dropping Rows with Missing
latlongData: Recognizing the critical nature of geographical data for our analysis, we remove any rows where thelatlonginformation is missing. This is achieved withdata.dropna(subset=['latlong']), ensuring that our dataset only includes entries with valid geolocation data. This step is crucial for analyses that depend on spatial information.Removing Unnecessary Columns: Further refining our dataset, we decide to eliminate columns that are not essential to our specific analytical goals. In this case,
last_updateandwastebasket_sizecolumns are removed usingdata.drop(columns=['last_update', 'wastebasket_size']). This decision might be driven by factors such as the columns' relevance to the research question, a high proportion of missing values, or the desire to streamline the dataset for efficiency.
By taking these steps, we ensure our dataset is more focused and manageable, eliminating potential sources of error or bias that could arise from missing or irrelevant data. This process sets a strong foundation for accurate and meaningful analysis.
#Dropping rows with missing 'latlong'
data = data.dropna(subset=['latlong'])
data = data.drop(columns=['last_update', 'wastebasket_size'])
2.8 Updating the new 'date' Column Data Type¶
To ensure our dataset's date column is in the optimal format for date-related operations and analysis, we convert it to a datetime data type. This conversion is crucial for several reasons:
Consistency and Accuracy: Converting the
datecolumn to a datetime format usingpd.to_datetime(data['date'])ensures that all entries are consistently recognized as dates. This uniformity is vital for accurate sorting, filtering, and time series analysis.Enhanced Date Functionality: With the
datecolumn in datetime format, we unlock a range of Pandas functionalities specific to date and time operations. This includes extracting components of the date (like the month or year), performing date arithmetic, and efficiently handling time zones if necessary.
This step is a fundamental part of data preprocessing, setting the stage for any analysis or operations that rely on temporal data. It ensures our dataset is accurately prepared to support insightful and reliable temporal analyses.
data['date'] = pd.to_datetime(data['date'])
# Parsing the datetime to ensure correct format
# data['time_only'] = pd.to_datetime(data['time_only'], errors='coerce')
2.9 Re Checking Current Data Types in the Dataset¶
After performing various data preprocessing steps, including handling missing values and optimizing data types, it's important to review the current state of our dataset's structure. This involves checking the data types of each column to ensure they align with our expectations and requirements for analysis. Here's how we do it:
Print Statement: We use a simple print statement to introduce the context, stating "Current data types".
Displaying Data Types: The
data.dtypescommand is executed to list each column's data type within our DataFrame. This output is crucial for verifying that our data preprocessing steps have been successfully applied, ensuring that each column is now properly formatted for efficient and accurate analysis.
By reviewing the data types, we confirm the dataset's readiness for further analysis, ensuring that numerical data can be correctly manipulated and that categorical and date-time information is appropriately formatted.
print("Current data types")
print(data.dtypes)
Current data types time datetime64[ns, UTC] serial string[python] sensor_name string[python] status_current_fill_level int64 bin_status category product_type category description string[python] address string[python] latlong object date datetime64[ns] time_only object dtype: object
2.10 Analyzing 'latlong' Column for Completeness¶
Ensuring the completeness and accuracy of geolocation data in our 'latlong' column is crucial for spatial analysis. To this end, we conduct a two-fold examination to identify any missing or blank entries:
Missing Values Check: We calculate the total number of missing (NaN) values in the 'latlong' column using
data['latlong'].isna().sum(). This step helps us understand if there are any entries that lack geolocation data entirely, which could impact geospatial analyses and visualizations.Blank Cells Check: To identify entries that might appear non-empty but contain only blank spaces, we use
data['latlong'].str.strip().eq('').sum(). This process strips any leading and trailing spaces from each entry before comparing it to an empty string, allowing us to count how many cells are effectively blank despite not being NaN.
The results of these checks are printed, providing insight into the data quality of the 'latlong' column. Identifying and quantifying both missing and blank values are essential steps in data cleaning, ensuring that subsequent spatial analyses are based on complete and accurate data.
# Check for missing values in 'latlong' column
missing_values_count = data['latlong'].isna().sum()
print(f"Missing values in 'latlong': {missing_values_count}")
# Check for blank cells in 'latlong' column
blank_cells_count = data['latlong'].str.strip().eq('').sum()
print(f"Blank cells in 'latlong': {blank_cells_count}")
Missing values in 'latlong': 0 Blank cells in 'latlong': 0
2.11 Inspecting a Non-Null 'latlong' Entry¶
To better understand the format and structure of our geolocation data within the 'latlong' column, we inspect a sample non-null entry. This examination is pivotal for planning any required data transformation or extraction steps, especially when preparing for spatial analysis. Here's our approach:
Filtering Non-Null Entries: We start by removing all null entries from the 'latlong' column using
data['latlong'].dropna(), ensuring we're only working with valid geolocation data.Sample Inspection: If the filtered series is not empty, indicating the presence of non-null geolocation data, we print the first non-null entry using
.iloc[0]. This provides us with a clear example of the data format we're dealing with.Handling Absence of Data: In cases where no non-null entries exist within the 'latlong' column, we print a message stating "No non-null entries found in 'latlong'." This helps in identifying datasets that may lack the necessary geolocation data for spatial analysis.
Inspecting a sample entry allows us to understand the formatting of our geolocation data, informing any necessary preprocessing steps to extract latitude and longitude for analysis purposes.
# Print the structure of a non-null 'lat_long' entry
non_null_entries = data['latlong'].dropna()
if not non_null_entries.empty:
print(non_null_entries.iloc[0])
else:
print("No non-null entries found in 'latlong'.")
{'lon': 144.966267, 'lat': -37.802983}
2.12 Extracting Latitude and Longitude¶
In our dataset, geolocation data within the 'latlong' column is stored in a dictionary format, necessitating the extraction of latitude and longitude into separate columns for more straightforward spatial analysis. The steps we take are as follows:
Latitude Extraction: We create a new column, 'latitude', by applying a function to each entry in the 'latlong' column. This function checks if the entry is a dictionary and, if so, retrieves the value associated with the 'lat' key. If the entry isn't a dictionary, it assigns
None.Longitude Extraction: Similarly, we generate a 'longitude' column by applying a function that extracts the value associated with the 'lon' key from each dictionary entry in the 'latlong' column. Entries that do not conform to the expected dictionary format are assigned
None.Verification: To ensure our extraction process was successful, we display the first few rows of the new 'latitude' and 'longitude' columns using
.head(). This step is crucial for verifying that the extraction process has correctly populated these new columns with the intended data.
This extraction method allows us to separate and utilize the geolocation data efficiently, laying the groundwork for detailed spatial analysis and visualization.
# Extract 'latitude' and 'longitude' from the dictionary in 'lat_long' column, using the correct keys
data['latitude'] = data['latlong'].apply(lambda x: x.get('lat') if isinstance(x, dict) else None)
data['longitude'] = data['latlong'].apply(lambda x: x.get('lon') if isinstance(x, dict) else None)
# Display the first few rows to verify the new columns
print(data[['latitude', 'longitude']].head())
latitude longitude 0 -37.802983 144.966267 1 -37.802017 144.965500 2 -37.803217 144.966100 3 -37.803117 144.965233 5 -37.802233 144.966500
2.13 Confirming Data Preprocessing with a New DataFrame¶
After meticulous data cleaning and preprocessing, including handling missing values, optimizing data types, and extracting critical geolocation information, we consolidate our efforts into a new DataFrame. This new DataFrame serves as a snapshot to confirm the effective application of our preprocessing steps. The process involves the following:
Selection of Key Columns: We create
new_DFby selecting columns that are essential for our subsequent analyses. These columns include 'date', 'time_only', 'serial', 'sensor_name', 'status_current_fill_level', 'latitude', and 'longitude'. This selection not only reflects the data cleaning and transformation efforts undertaken but also aligns with our analysis objectives by focusing on crucial data points.DataFrame Creation: The new DataFrame,
new_DF, is instantiated with the chosen columns, providing a streamlined and focused dataset for analysis. This step is crucial for verifying that our preprocessing tasks have been successfully applied and that the dataset is now in a suitable format for analysis.Review and Verification: Displaying
new_DFallows us to visually confirm the presence and correct formatting of the selected columns, ensuring that our dataset is ready for the next stages of our project.
This consolidation into a new DataFrame is a critical step in our data preparation process, enabling us to proceed with confidence in the quality and relevance of our data for analysis.
new_DF = data[['date', 'time_only','serial','sensor_name','status_current_fill_level','latitude','longitude']]
new_DF.head(3)
| date | time_only | serial | sensor_name | status_current_fill_level | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 2021-06-29 | 21:18:25 | FB1000001905AA05 | Sensor-05 | 14 | -37.802983 | 144.966267 |
| 1 | 2021-06-30 | 12:17:40 | FB1000001905AA15 | Sensor-15 | 49 | -37.802017 | 144.965500 |
| 2 | 2021-07-01 | 00:00:21 | FB1000001905AA17 | Sensor-17 | 100 | -37.803217 | 144.966100 |
2.14 Downloading cleaded data¶
This script facilitates the downloading of the dataset necessary for creating a Power BI dashboard, which is aimed at improving data visualization capabilities. To ensure the code functions correctly, it is important to remove the hash ('#') symbols that denote comments, and adjust the file path as needed to correctly locate and retrieve the dataset. By doing so, you will streamline the setup process, enabling a smoother and more efficient integration into the Power BI platform for effective data analysis and visualization.
# Specify the path and file name for the CSV file you want to create
#file_path = "C:/Users/dpmdj/Downloads/Pre_processed_Smart_bin_data.csv" # Adjust the path as needed
# Export the DataFrame to a CSV file
#new_DF.to_csv(file_path, index=False)
# If you need to provide a link to download the file (in a Jupyter environment, for example)
#print(f"Data exported to {file_path}")
3.0 Visualization Part 1 (Plotting the map and finding the best route)¶
3.1 Visualizing Geolocation Data with Folium¶
In our project, we leverage the Folium library to create interactive maps that can visualize the geolocation data we've prepared. The first step involves installing Folium, a powerful Python library used for visualizing geospatial data. Here’s how we proceed:
Installation: We begin by installing Folium using
pip install folium, ensuring that we have access to its comprehensive mapping capabilities.Importing Folium: With Folium installed, we import it into our notebook to start creating maps.
Setting Coordinates: We define the coordinates for Argyle Square, which will serve as the focal point of our map. The latitude (
-37.8036) and longitude (144.9655) values are set to center the map precisely on Argyle Square.Creating the Map: We instantiate a Folium map object using
folium.Map(), passing in the coordinates of Argyle Square along with azoom_startvalue of 17 to provide a detailed view of the area. Thetilesparameter is set to 'CartoDB positron' for a clean and modern map aesthetic.
This step is crucial for visualizing the spatial distribution of our data points, offering insights into patterns and trends that may not be apparent through numerical analysis alone. The interactive nature of Folium maps further enhances our ability to explore and interpret the data effectively.
pip install folium
Requirement already satisfied: folium in c:\users\dpmdj\anaconda3\lib\site-packages (0.16.0) Requirement already satisfied: branca>=0.6.0 in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (0.7.1) Requirement already satisfied: jinja2>=2.9 in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (3.1.3) Requirement already satisfied: numpy in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (1.26.4) Requirement already satisfied: requests in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (2.31.0) Requirement already satisfied: xyzservices in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (2022.9.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\dpmdj\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (2024.2.2) Note: you may need to restart the kernel to use updated packages.
# Create the Map Centered on Argyle Square
import folium
# Coordinates for Argyle Square
argyle_square_lat = -37.8036
argyle_square_lon = 144.9655
# Create a map centered around Argyle Square with a suitable zoom level
map = folium.Map(location=[argyle_square_lat, argyle_square_lon], zoom_start=17, tiles='CartoDB positron')
3.2 Adding Markers to the Map for Each Location¶
With our Folium map centered on Argyle Square, the next step enhances our spatial analysis by marking the precise locations of interest within our dataset. Here’s how we systematically add markers to our map:
Iterating Through Data: We loop through each row in our
new_DFDataFrame, which contains the cleaned and preprocessed data, including the essential 'latitude' and 'longitude' columns for our geolocation points.Marker Creation: For each row, we create a Folium marker using
folium.Marker(). This marker is set to the geographical coordinates found in the 'latitude' and 'longitude' columns of the row.Popup Information: To each marker, we add a popup that displays the latitude and longitude of the location. This popup is activated when the marker is clicked, providing quick access to precise location data.
Adding Markers to the Map: Each marker is then added to our previously created Folium map object with
.add_to(map). This action places the marker on the map at the designated coordinates.
By marking each location on the map, we visually represent our dataset's geospatial distribution. This method not only aids in identifying patterns or clusters in the data but also facilitates a more interactive and engaging analysis, allowing users to explore individual data points through their geographic context.
# Add Markers for All Latitude and Longitude Values
for index, row in new_DF.iterrows():
folium.Marker(
[row['latitude'], row['longitude']],
popup=f"Lat: {row['latitude']}, Lon: {row['longitude']}"
).add_to(map)
3.3 Visualizing Sensor Locations on a Map¶
To further enhance our spatial analysis, we create a simplified map centered around Argyle Square, using it as a base to display the locations of various sensors. This visual representation is key to understanding the geographical distribution of sensors and assessing any spatial patterns. Here's the step-by-step process:
Map Creation: We start by generating a new Folium map object,
simple_map, centered on Argyle Square. The map is initialized with a zoom level of 17, providing a close-up view of the area for detailed inspection.Adding Markers: We iterate over our dataset, adding a marker for each sensor's location. Each marker is configured with the sensor's latitude and longitude, ensuring accurate placement on the map.
Popup Feature: For added interactivity, each marker is equipped with a popup that displays the sensor's latitude and longitude. This feature allows users to click on any marker to reveal the exact coordinates of the sensor, enhancing the map's informational value.
Displaying the Map: Finally, the
simple_mapobject is displayed, showing all sensor locations as markers. When using a Jupyter notebook, the map will automatically render within the notebook interface, providing an immediate and interactive visual analysis tool.
By incorporating these steps, we create a comprehensive spatial visualization that allows for an intuitive exploration of sensor data, aiding in the identification of patterns, clusters, or gaps in sensor coverage.
# Create a map centered around the average location
Bin_location_map = folium.Map(location=[argyle_square_lat, argyle_square_lon], zoom_start=17)
# Loop through the DataFrame and add each bench location as a marker on the map
for index, row in new_DF.iterrows():
folium.Marker(
location=[row['latitude'], row['longitude']],
popup=f"Lat: {row['latitude']}, Lon: {row['longitude']}"
).add_to(Bin_location_map)
# To display the map in a Jupyter notebook
Bin_location_map
missing_values_count = data['sensor_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values_count}")
Number of missing values in column_name: 0
4.0 Visualization Part 2 (Finding the best route)¶
4.1 Optimizing Operational Efficiency through Data Normalization¶
In our pursuit of operational excellence, particularly in optimizing collection routes based on bin fill levels, it's imperative to standardize the data we base our decisions on. The process of normalizing the fill levels of bins to a uniform scale is a critical step in this journey. Here’s how and why we do it:
4.2 Step-by-Step Approach:¶
Normalization Process: We convert the
status_current_fill_levelvalues in ournew_DFDataFrame to a standardized scale ranging from 0 to 1. This is done by dividing each value by the maximum possible fill level, which is 100. This adjustment is encapsulated in the following line of code:new_DF['status_current_fill_level'] = new_DF['status_current_fill_level'] / 100
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
# Normalize fill_level if it's not already (max fill_level is 100 )
#new_DF['status_current_fill_level'] = new_DF['status_current_fill_level'] / 100
4.3 Calculating the Pairwise Distance Matrix¶
A critical step in optimizing routes and analyzing spatial relationships within our dataset is calculating the pairwise distance matrix. This matrix helps us understand the distances between all pairs of points (in this case, bins) based on their geographic coordinates. Here’s a detailed breakdown of the process:
4.4 Steps and Code:¶
Coordinate Extraction: We start by extracting the
latitudeandlongitudevalues from ournew_DFDataFrame into a new variable,coords. This subset contains just the essential geographic coordinates for each location.coords = new_DF[['latitude', 'longitude']]
Distance Matrix Calculation: Utilizing the cdist function from SciPy's spatial distance library, we compute the Euclidean distance between each pair of points in coords. The result, dist_matrix, is a matrix where each element represents the distance between a pair of locations.
# Calculate pairwise distance matrix
coords = new_DF[['latitude', 'longitude']]
dist_matrix = cdist(coords, coords, metric='euclidean')
print(dist_matrix)
[[0. 0.00123347 0.00028748 ... 0.00078536 0.00058005 0.00062876] [0.00123347 0. 0.00134164 ... 0.00102306 0.00067859 0.00184459] [0.00028748 0.00134164 0. ... 0.00106219 0.00066339 0.00053151] ... [0.00078536 0.00102306 0.00106219 ... 0. 0.00078355 0.001334 ] [0.00058005 0.00067859 0.00066339 ... 0.00078355 0. 0.00117004] [0.00062876 0.00184459 0.00053151 ... 0.001334 0.00117004 0. ]]
4.5 Route Optimization Based on Fill Levels and Proximity¶
To enhance operational efficiency in servicing bins, we employ a strategy that prioritizes bins with higher fill levels and minimizes travel distance. This two-pronged approach ensures that the most critical bins are serviced first while also optimizing the route taken to reduce time and resources spent. Here’s how we implement this strategy:
4.6 Prioritizing by Fill Level:¶
- Sorting Locations: We begin by sorting the locations based on their fill levels in descending order. This is achieved by using
np.argsorton the negative values of thestatus_current_fill_levelcolumn innew_DF. The result,priority_indices, represents the indices of locations sorted by their priority, with higher fill levels first.
# Sort locations by descending fill level (prioritize higher fill levels)
priority_indices = np.argsort(-new_DF['status_current_fill_level'])
4.7 Route Optimization:¶
- Initializing the Route: The route list is initialized with the starting point, chosen as the first location in our prioritized list. This represents the location with the highest fill level.
# Initialize route list with the starting point (assuming first location as start)
route = [priority_indices[0]]
4.8 Determine Next Stop:¶
To determine the next stop, we follow a simple heuristic: from the current location, move to the nearest unvisited location. For each subsequent location:
Calculate the distances from the last visited location to all others using the precomputed distance matrix, dist_matrix.
Temporarily set the distance to already visited locations to infinity to exclude them from consideration.
Identify the next stop as the location with the minimum distance to the last visited location. This location is then added to the route.
This methodical approach to route optimization, prioritizing bins by their necessity for service (based on fill levels) and sequentially choosing the next nearest location, significantly enhances efficiency. It not only ensures the most critical bins are attended to promptly but also minimizes the operational costs associated with travel distances.
for _ in range(1, len(priority_indices)):
last_visited = route[-1]
distances_to_last_visited = dist_matrix[last_visited]
# Filter out already visited locations
distances_to_last_visited[route] = np.inf
next_stop = np.argmin(distances_to_last_visited)
route.append(next_stop)
4.9 Visualization of the Optimized Route¶
To further refine our visualization and provide a comprehensive view of the optimized route, we employ Folium to map out each stop with markers and visually connect these stops with lines. This approach not only highlights the individual locations but also delineates the sequence and path of the route, offering a holistic view of the journey. Here's the step-by-step process:
4.10 Setting Up the Map:¶
- Centering the Map: We calculate the central point of our route by averaging the latitude and longitude of all stops. This ensures the map is optimally centered, making all route points easily viewable.
# Visualization
map_center = [new_DF['latitude'].mean(), new_DF['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=20)
4.11 Marking Stops:¶
- Placing Marker: Each location along the route is marked on the map with a blue marker. These markers are annotated with popups that display the fill level percentage, providing immediate insight into each stop's priority.
# Add markers for each location in the route
#for i, idx in enumerate(route):
for idx in route:
folium.Marker([new_DF.iloc[idx]['latitude'], new_DF.iloc[idx]['longitude']],
popup=f"Fill Level: {new_DF.iloc[idx]['status_current_fill_level']}%",
icon=folium.Icon(color="blue", icon="info-sign")).add_to(m)
# Add a marker for the starting point
start_idx = route[0]
folium.Marker(
[new_DF.iloc[start_idx]['latitude'], new_DF.iloc[start_idx]['longitude']],
popup="Start",
icon=folium.Icon(color="purple", icon="play")
).add_to(m)
# Add a marker for the end point
end_idx = route[-1]
folium.Marker(
[new_DF.iloc[end_idx]['latitude'], new_DF.iloc[end_idx]['longitude']],
popup="End",
icon=folium.Icon(color="blue", icon="stop")
).add_to(m)
def get_marker_color(fill_level):
"""Determine the marker color based on fill level."""
if fill_level <= 30:
return "green"
elif fill_level <= 70:
return "orange"
else:
return "red"
for idx in route:
fill_level = new_DF.iloc[idx]['status_current_fill_level']
folium.Marker(
[new_DF.iloc[idx]['latitude'], new_DF.iloc[idx]['longitude']],
popup=f"Fill Level: {fill_level}%",
icon=folium.Icon(color=get_marker_color(fill_level), icon="info-sign")
).add_to(m)
4.12 Illustrating the Route:¶
- Connecting with Lines: To visualize the travel path, we draw red lines between consecutive stops along the route. This is achieved by iterating through the route and drawing a polyline between each pair of locations.
# Display route as lines
for i in range(len(route) - 1):
start_pos = [new_DF.iloc[route[i]]['latitude'],new_DF.iloc[route[i]]['longitude']]
end_pos = [new_DF.iloc[route[i+1]]['latitude'], new_DF.iloc[route[i+1]]['longitude']]
folium.PolyLine([start_pos, end_pos], color='red').add_to(m)
#for i in range(len(route)-1):
#folium.PolyLine([new_DF.iloc[route[i]][['latitude', 'longitude']].values,
# new_DF.iloc[route[i+1]][['latitude', 'longitude']].values],
#color="red").add_to(m)
# Save the map to an HTML file
m.save('optimized_route_map.html')
# To display the map in a Jupyter notebook
m